SQL vs NoSQL - Some thoughts - Obsidian Publish

the choice between RDBMS and NoSQL might be approached from risk perspective - Here's an example of choosing between RDS (a managed relational database) vs. DynamoDB (a fully managed key-value), but might be generalized to RDBMS vs NoSQl - It might be that the decision is a [[Cynefin framework|complicated problem]], so one might try to discover the unknown knows and proceed from there. One approach to the unknown knows might be by addressing [[Map of risks|the risks]]. This is an attempt to do so. ## Design principles **Design principles.** Dynamo [was designed](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) to handle high-scale, high throughput, low latency, and key-value access. A Dynamo database consists of a table, partitioned by a primary key. And RDBMS offered by RDS [were designed](https://www.goodreads.com/book/show/23463279-designing-data-intensive-applications) to provide transactional OLTP operations on structured data. ## Some risks #### feasibility risk (and modification) risk - **Query patterns.** The [advice](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html#bp-general-nosql-design-vs-relational) [for Dynamo](https://www.youtube.com/watch?v=HaEPXoXVf2k) goes: "You shouldn't start designing your schema for DynamoDB until you know the questions it will need to answer." Dynamo seems to require the query access patterns beforehand. - **boring technologies:** [[The value of boring technologies]], co the currently used technologies and team experienced seems important factors to consider #### quality risk - **Database migrations.** RDS supports database [[DB migrations seem to simplify SQL schema management quite a lot|migrations]] with matured & battle-tested tools. Migrations allow to version control the database schema and evolve the data structures. Migrating data in Dynamo is currently left to a developer (At least I haven't found an out-of-the-box battle-tested tool that provides data migrations for Dynamo). So if the shape of data is expected to evolve, there's a beaten path on the RDS turf. - **Data validation.** Dynamo doesn't seem to support data validation out-of-the-box; validation is left to an application. Dynamo supports limited types, namely, booleans, numbers, and string. RDS comes with richer types, such as enums and timestamps. RDS also supports validation checks on data entry via [constraints](https://www.postgresql.org/docs/current/ddl-constraints.html). In my experience, loose constraints–[exemplified in the 57 ways how Philadelphia is spelled in PPP loans](https://twitter.com/dataeditor/status/1280278987797942272)–is one of the root causes of bugs and confusion in business. - See: [[Strict value checks might help to fail-fast and prevent unexpected state later]] #### modification risk: - **data lifespan.** Data can be short-lived and temporary, such as customer session data valid for seven days. Or data can be long-lived and persistent, such as customer details. Changing the data schema in Dynamo can be done by supporting two code-paths within the application. For short-lived data, this could be fine; for long-lived data, this could become a nightmare. On the contrary, changing the data schema in RDS can be done with migrations and a single code-path. But for a massive scale, changing the schema in RDS could be costly. - **company/product maturity.** If a company is in the early-stage or experimenting with product-market fit, data access patterns might be floating. For this company, RDS offers flexibility to explore the problem space. On the other hand, a matured company struggling with scaling issues is in a different position, and Dynamo could offer a better fit. #### performance (and modification) risk - **Joins.** Dynamo doesn't seem to support joins out-of-the-box, joins need to be [implemented by denormalizing data](https://www.youtube.com/watch?v=HaEPXoXVf2k). RDS supports joins out-of-the-box. The downside of RDS is that joins could become expensive if the dataset scales. - **Scalability & latency.** Dynamo offers high scalability out-of-the-box and single-digit microsecond latency, easily supporting 100k reads per second. RDS could be scaled up and out, but the scalability of Dynamo is out of reach for RDS. - **Views & aggregations.** RDS supports views and materialized views out-of-the-box. However, when data scales, refreshing a materialized view can be costly. In Dynamo, views need to be implemented on the application side, for example, via [Dynamo Streams](https://aws.amazon.com/blogs/database/dynamodb-streams-use-cases-and-design-patterns/). In RDS, creating and modifying views is easy; in Dynamo, creating and modifying views is costly. Once a view is created in Dynamo, the cost of refreshing is negligible. #### availability risk: - **Security, backups, maintenance.** Both services seem to offer backups, point-in-type recovery, encrypted data. Both services are managed: Dynamo provisioning is a no-ops, RDS might require tweaking if data scales. #### business viability risk - **Analytics, OLAP & Data Warehouses.** In my experience, integrating a BI tool with an RDS seems straightforward; integrating a BI tool with Dynamo might turn out tricky. However, a BI tool can be connected to a data warehouse, such as AWS Redshift or Google BigQuery. In this case, there are [ETL SaaS solutions](https://fivetran.com/) that load the data from RDS and Dynamo into the data warehouse. ## some thoughts - On the solutions I worked on, most didn't require _massive_ scaling nor _massive_ throughput. Most of data for these systems would fit [in a single unoptimized machine](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html). - However, I have witnesses extensively another problem: loose data constraints [[strict assertion mode|leading to poor data quality]] - RDS seems to be the beaten boring path - interesting question to ponder might be if the decision is [[Cynefin framework|a complex or a complicated problem]]? > Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we _think about the problem_ that we are solving. > > —Martin Kleppmann, [Desigining Data-Intensive Applications](https://www.goodreads.com/book/show/23463279-designing-data-intensive-applications?from_search=true&from_srp=true&qid=6kFvKCyiVg&rank=1)